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Abstract 

We describe a iterative procedure for optimizing 
policies, with guaranteed monotonic improve¬ 
ment. By making several approximations to the 
theoretically-justified procedure, we develop a 
practical algorithm, called Trust Region Policy 
Optimization (TRPO). This algorithm is similar 
to natural policy gradient methods and is effec¬ 
tive for optimizing large nonlinear policies such 
as neural networks. Our experiments demon¬ 
strate its robust performance on a wide variety 
of tasks; learning simulated robotic swimming, 
hopping, and walking gaits; and playing Atari 
games using images of the screen as input. De¬ 
spite its approximations that deviate from the 
theory, TRPO tends to give monotonic improve¬ 
ment, with little tuning of hyperparameters. 


1 Introduction 


Most algorithms for policy optimization can be classihed 
into three broad categories: (1) policy iteration methods, 
which alternate between estimating the value function un¬ 


der the current policy and improving the policy (Bertsekas 


20051; (2) policy gradient methods, which use an estima¬ 


tor of the gradient of the expected return (total reward) ob¬ 


tained from sample trajectories (Peters & Schaal 2008aI 
(and which, as we later discuss, have a close connection to 
policy iteration); and (3) derivative-free optimization meth¬ 
ods, such as the cross-entropy method (CEM) and covari¬ 
ance matrix adaptation (CMA), which treat the return as a 
black box function to be optimized in terms of the policy 


parameters (Szita & Lorincz 20061. 


General derivative-free stochastic optimization methods 
such as CEM and CMA are preferred on many prob¬ 
lems, because they achieve good results while being sim¬ 
ple to understand and implement. For example, while 


Proceedings of the 31 International Conference on Machine 
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy¬ 
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Tetris is a classic benchmark problem for approximate dy¬ 
namic programming (ADP) methods, stochastic optimiza¬ 
tion methods are difficult to beat on this task ( [Gabillon] 
et al. 20131. For continuous control problems, methods 
like CMA have been successful at learning control poli¬ 
cies for challenging tasks like locomotion when provided 
with hand-engineered policy classes with low-dimensional 
parameterizations ( Wampler & Popovic[ 2009) 1. The in¬ 
ability of ADP and gradient-based methods to consistently 
beat gradient-free random search is unsatisfying, since 
gradient-based optimization algorithms enjoy much better 
sample complexity guarantees than gradient-free methods 
( |Nemirovski[ 2005 | l. Continuous gradient-based optimiza¬ 
tion has been very successful at learning function approxi¬ 
mators for supervised learning tasks with huge numbers of 
parameters, and extending their success to reinforcement 
learning would allow for efficient training of complex and 
powerful policies. 

In this article, we first prove that minimizing a certain sur¬ 
rogate objective function guarantees policy improvement 
with non-trivial step sizes. Then we make a series of ap¬ 
proximations to the theoretically-justified algorithm, yield¬ 
ing a practical algorithm, which we call trust region pol¬ 
icy optimization (TRPO). We describe two variants of this 
algorithm; first, the single-path method, which can be ap¬ 
plied in the model-free setting; second, the vine method, 
which requires the system to be restored to particular states, 
which is typically only possible in simulation. These al¬ 
gorithms are scalable and can optimize nonlinear policies 
with tens of thousands of parameters, which have previ¬ 
ously posed a major challenge for model-free policy search 


(Deisenroth et al. 20131. In our experiments, we show that 
the same TRPO methods can learn complex policies for 
swimming, hopping, and walking, as well as playing Atari 
games directly from raw images. 


2 Preliminaries 


Consider an infinite-horizon discounted Markov decision 
process (MDP), defined by the tuple {S, A, P,r, po,j), 
where 5 is a finite set of states, A is a finite set of actions, 
P: SxAxS—rK is the transition probability distri- 






























Trust Region Policy Optimization 


bution, r : 5 —>■ K is the reward function, po ■ >5 —> K is 
the distribution of the initial state Sq, and 7 G ( 0 , 1 ) is the 
discount factor. 

Let TT denote a stochastic policy tt : 5 x ^ [0,1], and 

let p( 7 r) denote its expected discounted reward; 


VM = Eso,ao,... 

So Po(so), at 


_i^0 


where 


~ 7r(at|st), st+i ~ P{st+i\st,at). 


We will use the following standard definitions of the state- 
action value function the value function 14 , and the 
advantage function : 


at) — 


^7V(st+/) 


. 1=0 




^7V(st+i) 


Li=0 


4^(s,a) = ( 57 r(s, a) - 14(s), where 

at ~ 7r(at|st),st+i -- P{st+i\st,at) for f > 0 . 


The following useful identity expresses the expected return 
of another policy tt in terms of the advantage over tt, accu¬ 
mulated over timesteps (see Kakade & Langford ( 2002| l or 
Appendix [A| for proof); 


rj{n) = T]{Tr)+Es„,o 


t^O 


7*A^(st,at) 


( 1 ) 


where the notation [...] indicates that actions 

are sampled at ^ 7 f(-|st). Let be the (unnormalized) 
discounted visitation frequencies 


7 r(s) = argmax^ ^ 7 r(s, a), improves the policy if there is 
at least one state-action pair with a positive advantage value 
and nonzero state visitation probability, otherwise the algo¬ 
rithm has converged to the optimal policy. However, in the 
approximate setting, it will typically be unavoidable, due 
to estimation and approximation error, that there will be 
some states s for which the expected advantage is negative, 
that is, 7 f(a|s)A 7 r(s, a) < 0. The complex dependency 
of ps-{s) on TT makes Equation (j^ difficult to optimize di¬ 
rectly. Instead, we introduce the following local approxi¬ 
mation to 77 ; 

L^{tt) = r]{7r) -f ^ p^{s) ^ 7 f(a|s)A^(s, a). (3) 

s a 


Note that uses the visitation frequency rather than 
Pfr, ignoring changes in state visitation density due to 
changes in the policy. However, if we have a parameter¬ 
ized policy TTg, where 7rg(a|s) is a differentiable function 
of the parameter vector 0, then matches 77 to first order 
(see Kakade & Langford (2002 1 ). That is, for any parame¬ 
ter value 00 , 


= (4) 

Equation (j^ implies that a sufficiently small step irgg —> tt 
that improves LT^g will also improve 77 , but does not give 
us any guidance on how big of a step to take. 


To address this issue, Kakade & Langford ( 2002| l proposed 
a policy updating scheme called conservative policy iter¬ 
ation, for which they could provide explicit lower bounds 
on the improvement of 77 . To define the conservative pol¬ 
icy iteration update, let tTom denote the current policy, and 
let tt' = argmin^/ The new policy TTnew was 

defined to be the following mixture; 


P7r(s)=P(so = s)+7-P(si = s)+7^-f’(S2 = s)-f. . . , 


7’‘new(a|s) = (1 “ a)7roid(a|s) -f Q!7r'(a|s). (5) 


where sq ~ po the actions are chosen according to tt. 
We can rewrite Equation (0 with a sum over states instead 
of timesteps; 


0 W = 0 ( 7 ^) + X! X! X! ^ia\s)-f*A^{s, a) 

t—0 s a 

00 

= 0 ( 77 ) + X! X! = s|^) a) 

s t—0 a 

= t]{tt)+ ^Piis)^n{a\s)A^{s,a). (2) 

s a 

This equation implies that any policy update tt —> if that 
has a nonnegative expected advantage at every state s, 
i.e., — d, is guaranteed to increase 

the policy performance 77 , or leave it constant in the case 
that the expected advantage is zero everywhere. This im¬ 
plies the classic result that the update performed by ex¬ 
act policy iteration, which uses the deterministic policy 


Kakade and Langford proved the following result for this 
update; 


7?(7rnew) ^ -^TToid (T^new) 


2e7 


(1 - 7(1 - a))(l - 7 ) 
where e = max|E„„.^/(o|^) [A^(s, a)]| ( 6 ) 

Since a, 7 € [0,1], Equation (|^ implies the following sim¬ 
pler bound, which we refer to in the next section; 

2e7 2 


77(TTnew) ^ -^iTold ( '^new) 


(7) 


(1-7)2 ■ 

The simpler bound is only slightly weaker when a <C 1, 
which is typically the case in the conservative policy itera¬ 


tion method of Kakade & Langford (2002 1 . Note, however, 
that so far this bound only applies to mixture policies gen¬ 
erated by Equation 0 . This policy class is unwieldy and 
restrictive in practice, and it is desirable for a practical pol¬ 
icy update scheme to be applicable to all general stochastic 
policy classes. 
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3 Monotonic Improvement Guarantee for 
General Stochastic Policies 

Equation (j^, which applies to conservative policy itera¬ 
tion, implies that a policy update that improves the right- 
hand side is guaranteed to improve the true performance 
p. Our principal theoretical result is that the policy im¬ 
provement bound in Equation (j^ can be extended to gen¬ 
eral stochastic policies, rather than just mixture polices, 
by replacing a with a distance measure between tt and 
TT. Since mixture policies are rarely used in practice, this 
result is crucial for extending the improvement guarantee 
to practical problems. The particular distance measure we 
use is the total variation divergence, which is defined by 
Dtv{p II 9 ) = 5 ~ 9 »l for discrete probability dis¬ 

tributions p, gj^Dehne D™y^{TT, tt) as 

D^y^{Tr,7r) = maxDTviT^i'ls) || ^(-Is)). (8) 

S 

Theorem 1. Let a = /^““(Troid, T^new)- Then Equa¬ 
tion 0 holds. 

We provide two proofs in the appendix. The first proof ex¬ 
tends Kakade and Langford’s result using the fact that the 
random variables from two distributions with total varia¬ 
tion divergence less than a can be coupled, so that they are 
equal with probability 1 — a. The second proof uses pertur¬ 
bation theory to prove a slightly stronger version of Equa¬ 
tion 0. with a more favorable definition of e that depends 

on TT. 


Next, we note the following relationship between the to 
tal var iation divergence and the KL divergence ((Pollard 
( |2000| , Ch. 3): Dtv{p II q)^ < Dkl{p || q)- Let 
Ilj^“(7r,7f) = maxg DKL(7r(js) || ^^(js)). The follow¬ 
ing bound then follows directly from Equation 0: 


r]{n) > L^{n) - {Tr,n), 

2e'y 

whereC=- -4^. (9) 

(1-7)2 

Algorithm [T] describes an approximate policy iteration 
scheme based on the policy improvement bound in Equa¬ 
tion 0. Note that for now, we assume exact evaluation 
of the advantage values Algorithm [T] uses a constant 
e' < e that is simpler to describe in terms of measurable 
quantities. 

It follows from Equation (j^ that Algorithm[T]is guaranteed 
to generate a monotonically improving sequence of policies 
vi'^o) ^ ^(^ 1 ) ^ vi'^ 2 ) < • • ■ ■ To see this, let Mi{Tr) = 
L^^{7 t)-CD^I^{tt,,tt). Then 


> Mi{7ri+i) by Equation (|^ 
p{TTi) = Mi{TTi), therefore, 

'q{^T^+l) - r](7Ti) > Mi(7ri+i) - M{TTi). (10) 


*Our result is straightforward to extend to continuous states 
and actions by replacing the sums with integrals. 


Algorithm 1 Approximate policy iteration algorithm guar¬ 
anteeing non-decreasing expected return p 
Initialize ttq. 

for i = 0,1, 2,... until convergence do 
Compute all advantage values (s, o). 

Solve the constrained optimization problem 


TTi+i = arg max 


Ltt, (tt) - 


2e'7 


,(1-7)^ 

where e' = maxmax|A^(s, a)| 


L'kL 


and (tt) =?7(7ri)-|-^ p^,(s)^7r(a|s)A^, (s, a) 


end for 


Thus, by maximizing Mi at each iteration, we guarantee 
that the true objective rj is non-decreasing. This algorithm 
is a type of minorization-maximization (MM) algorithm 
( Hunter & Lange | 2004| l, which is a class of methods that 
also includes expectation maximization. In the terminol¬ 
ogy of MM algorithms. Mi is the surrogate function that 
minorizes p with equality at tt^. This algorithm is also rem¬ 
iniscent of proximal gradient methods and mirror descent. 


Trust region policy optimization, which we propose in the 
following section, is an approximation to Algorithm 
which uses a constraint on the KL divergence rather than 
a penalty to robustly allow large updates. 


4 Optimization of Parameterized Policies 

In the previous section, we considered the policy optimiza¬ 
tion problem independently of the parameterization of tt 
and under the assumption that the policy can be evaluated 
at all states. We now describe how to derive a practical 
algorithm from these theoretical foundations, under finite 
sample counts and arbitrary parameterizations. 

Since we consider parameterized policies 7re(a|s) with pa¬ 
rameter vector 9, we will overload our previous notation 
to use functions of 9 rather than tt, e.g. r]{9) := p(7rg), 
Le{9) ■■= L^,{TTg),andDKLid || 9) := L>KL(7re || tt^). We 
will use 0oid to denote the previous policy parameters that 
we want to improve upon. 

The preceding section showed that r]{9) > - 

(7£)max(gojd, 0), with equality at 9 = 0oid- Thus, by per¬ 
forming the following maximization, we are guaranteed to 
improve the true objective rj: 


maximize [Lg^^^{9) - CD^l^{9oid, 6*)] ■ 
8 


In practice, if we used the penalty coefficient C recom¬ 
mended by the theory above, the step sizes would be very 
small. One way to take larger steps in a robust way is to use 
a constraint on the KL divergence between the new policy 
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and the old policy, i.e., a trust region constraint: 

maximize (11) 

Q 

subject to D'^f^{9a\d^ S) < 6. 


This problem imposes a constraint that the KL divergence 
is bounded at every point in the state space. While it is 
motivated by the theory, this problem is impractical to solve 
due to the large number of constraints. Instead, we can use 
a heuristic approximation which considers the average KL 
divergence: 

^kl(6'i.^ 2) := [i:iKL(7re,(-|s) || 7rp,(-|s))]. 

We therefore propose solving the following optimization 
problem to generate a policy update: 

maximize (12) 

6 

subject to (^qU, 


Similar policy updates have been proposed in prior work 


( Bagnell & Schneider| |2003] [Peters & Schaal) |2008bt [Pe¬ 


ters et al!f 2010| l, and we compare our approach to prior 

methods in Section]^ and in the experiments in Section]^ 
Our experiments also show that this type of constrained 
update has similar empirical performance to the maximum 
KL divergence constraint in Equation GD- 


5 Sample-Based Estimation of the Objective 
and Constraint 


The previous section proposed a constrained optimization 
problem on the policy parameters (Equation ([T2li), which 
optimizes an estimate of the expected total reward p sub¬ 
ject to a constraint on the change in the policy at each up¬ 
date. This section describes how the objective and con¬ 
straint functions can be approximated using Monte Carlo 
simulation. 



Figure 1. Left: illustration of single path procedure. Here, we 
generate a set of trajectories via simulation of the policy and in¬ 
corporate all state-action pairs (s„, a„) into the objective. Right: 
illustration of vine procedure. We generate a set of “trunk” tra¬ 
jectories, and then generate “branch” rollouts from a subset of the 
reached states. For each of these states s„, we perform multiple 
actions (oi and 02 here) and perform a rollout after each action, 
using common random numbers (CRN) to reduce the variance. 


Our optimization problem in Equation 0 is exactly 
equivalent to the following one, written in terms of expec¬ 
tations: 


maximize E^^ 
e 

subject to Eg, 




'7re(a|s) 
. 9(a|s) 


Qs^iAs,a) 


(14) 


,, [L'KL(7re„i^(-|s) II 7rs(-|s))] < S. 


All that remains is to replace the expectations by sample 
averages and replace the Q value by an empirical estimate. 
The following sections describe two different schemes for 
performing this estimation. 

The first sampling scheme, which we call single path, is 
the one that is typically used for policy gradient estima¬ 
tion ( [Bartlett & Baxter] [201 1| ), and is based on sampling 
individual trajectories. The second scheme, which we call 
vine, involves constructing a rollout set and then perform¬ 
ing multiple actions from each state in the rollout set. This 
method has mostly been explored in the context of policy it¬ 
eration methods (jLagoudakis & Parrl|2003||Gabillon et al.j 

|2m^ . 


We seek to solve the following optimization problem, ob¬ 
tained by expanding in Equation 0: 

maximize A) ^ 7re{a|s)Ae^jj (s, a) 

s a 

subject to L>kl“' (^oid, 0) < S. (13) 

We first replace PSaid (A [■ ■ • ] in the objective by the ex¬ 
pectation [•••]• Next, we replace the advan¬ 

tage values by the Q-values in Equation ( |T3] l, 
which only changes the objective by a constant. Last, we 
replace the sum over the actions by an importance sampling 
estimator. Using q to denote the sampling distribution, the 
contribution of a single to the loss function is 

TTp (U| Syi) (5^ , U) E^^^g 

a 


5.1 Single Path 

In this estimation procedure, we collect a sequence of 
states by sampling sq po and then simulating the pol¬ 
icy for some number of timesteps to generate a trajec¬ 
tory sq, oq, si, oi,..., st-i, ut-i, sr- Hence, ( 7 (a|s) = 
'^eoid(®l'5)' computed at each state-action 

pair (st, at) by taking the discounted sum of future rewards 
along the trajectory. 

5.2 Vine 

In this estimation procedure, we first sample Sq ~ po and 
simulate the policy TTg. to generate a number of trajecto¬ 
ries. We then choose a subset of N states along these tra¬ 
jectories, denoted si, S 2 , ■ ■ ■, s^, which we call the “roll¬ 
out set”. Eor each state Sn in the rollout set, we sample 
K actions according to an,k ~ <?('|sn)- Any choice of 


7re(a|sn) 

qia\sn) 


^0aldAn,a) 
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9(’|sn) with a support that includes the support of 7rg. (-Is^) 
will produce a consistent estimator. In practice, we found 
that ( 7 (-|s„) = TTg. (’Isn) works well on continuous prob¬ 
lems, such as robotic locomotion, while the uniform dis¬ 
tribution works well on discrete tasks, such as the Atari 
games, where it can sometimes achieve better exploration. 

For each action an,k sampled at each state s„, we esti¬ 
mate Qei{sn,an,k) by performing a rollout (i.e., a short 
trajectory) starting with state s„ and action Un^k- We can 
greatly reduce the variance of the Q-value differences be¬ 
tween rollouts by using the same random number sequence 
for the noise in each of the K rollouts, i.e., common random 
numbers. See ( |Bertseka^ |2005| l for additional discussion 
on Monte Carlo estimation of Q-values and ( |Ng & Jordan 
|2000[ ) for a discussion of common random numbers in re¬ 
inforcement learning. 


an arbitrary state. In contrast, the single path algorithm re¬ 
quires no state resets and can be directly implemented on a 
physical system ([Peters & Schaal[ 2008b |l. 


6 Practical Algorithm 


Here we present two practical policy optimization algo¬ 
rithm based on the ideas above, which use either the single 
path or vine sampling scheme from the preceding section. 
The algorithms repeatedly perform the following steps; 


1. Use the single path or vine procedures to collect a set 
of state-action pairs along with Monte Carlo estimates 
of their (J-values. 


2. By averaging over samples, construct the estimated 
objective and constraint in Equation (14 1 . 


In small, finite action spaces, we can generate a rollout for 
every possible action from a given state. The contribution 
to from a single state is as follows: 

K 

L„(«) = '^e{ak\sn)Q{sn,ak), (15) 

k^l 


3. Approximately solve this constrained optimization 
problem to update the policy’s parameter vector 9. 
We use the conjugate gradient algorithm followed by 
a line search, which is altogether only slightly more 
expensive than computing the gradient itself. See Ap- 
pendixj^for details. 


where the action space is A = {oi, 02 ,..., ax}- In large 
or continuous state spaces, we can construct an estima¬ 
tor of the surrogate objective using importance sampling. 
The self-normalized estimator (Owen ( 2013| ), Chapter 8) 
of Tsoid obtained at a single state s„ is 


Ln{e) = 


l^k=l 


7ri9(o.i,fc|Sn) 


(a^.klsrv) <^n,k) 


l^k=l 


T^eian,k\s„) 


(16) 


assuming that we performed K actions 
an,i,an, 2 , ■ ■ ■ ,o,n,K from State s„. This self-normalized 
estimator removes the need to use a baseline for the 
(5-values (note that the gradient is unchanged by adding a 
constant to the (5-values). Averaging over s„ ~ we 
obtain an estimator for as well as its gradient. 

The vine and single path methods are illustrated in Figure[^ 
We use the term vine, since the trajectories used for sam¬ 
pling can be likened to the stems of vines, which branch at 
various points (the rollout set) into several short offshoots 
(the rollout trajectories). 

The benefit of the vine method over the single path method 
that is our local estimate of the objective has much lower 
variance given the same number of (5-value samples in the 
surrogate objective. That is, the vine method gives much 
better estimates of the advantage values. The downside of 
the vine method is that we must perform far more calls to 
the simulator for each of these advantage estimates. Fur¬ 
thermore, the vine method requires us to generate multiple 
trajectories from each state in the rollout set, which limits 
this algorithm to settings where the system can be reset to 


With regard to (3), we construct the Fisher informa¬ 
tion matrix (FIM) by analytically computing the Hessian 
of the KL divergence, rather than using the covariance 
matrix of the gradients. That is, we estimate Aij as 
a^-^KL(7re„i^(-|sn) II 7re(-|s„)), rather than 

7fJ2n=i^,^oS^eMsu)-^logng{an\sn). The ana¬ 
lytic estimator integrates over the action at each state Sn, 
and does not depend on the action a„ that was sampled. 
As described in Appendix this analytic estimator has 
computational benefits in the large-scale setting, since it 
removes the need to store a dense Hessian or all policy gra¬ 
dients from a batch of trajectories. The rate of improvement 
in the policy is similar to the empirical FIM, as shown in 
the experiments. 

Let us briefly summarize the relationship between the the¬ 
ory from Section|^and the practical algorithm we have de¬ 
scribed: 

• The theory justihes optimizing a surrogate objective 

with a penalty on KL divergence. However, the large 
penalty coefhcient leads to prohibitively small 

steps, so we would like to decrease this coefficient. 
Empirically, it is hard to robustly choose the penalty 
coefficient, so we use a hard constraint instead of a 
penalty, with parameter 5 (the bound on KL diver¬ 
gence). 

• The constraint on D^f^{9a\d^ &) is hard for numerical 
optimization and estimation, so instead we constrain 
-^KlC^oW, 0 ). 
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Our theory ignores estimation error for the advantage 
function. Kakade & Langford (20021 consider this er¬ 
ror in their derivation, and the same arguments would 
hold in the setting of this paper, but we omit them for 
simplicity. 


7 Connections with Prior Work 


As mentioned in Section]^ our derivation results in a pol¬ 
icy update that is related to several prior methods, provid¬ 
ing a unifying perspective on a number of policy update 
schemes. The natural policy gradient (Kakade 2002| l can 
be obtained as a special case of the update in Equation ( fT2| ) 
by using a linear approximation to L and a quadratic ap¬ 
proximation to the constraint, resulting in the follow¬ 
ing problem: 


maximize VeLe^jj(6»)| • (0 - 0oid) 


subject to ^(^oid - 6')^^(6'oid)(6'oid -0) <5, 
where A{Bo\d)ij = 

[£’KL(7r(-|s,6loid) II 7r(-|s,6»))] | 


(17) 


The update is 6>new = fi'oid + {A{9oid) ^VeL(6>) 


where the stepsize ^ is typically treated as an algorithm 
parameter. This differs from our approach, which en¬ 
forces the constraint at each update. Though this difference 
might seem subtle, our experiments demonstrate that it sig¬ 
nificantly improves the algorithm’s performance on larger 
problems. 


We can also obtain the standard policy gradient update by 
using an £2 constraint or penalty: 


maximize 




0—6oid 


{S — 9o\d) 


(18) 


subject to 2 11^ ~ ^oidlP < 



Figure 2. 2D robot models used for locomotion experiments. 
From left to right: swimmer, hopper, walker. The hopper and 
walker present a particular challenge, due to underactuation and 
contact discontinuities. 

8 Experiments 

We designed our experiments to investigate the following 
questions: 

1. What are the performance characteristics of the single 
path and vine sampling procedures? 

2. TRPO is related to prior methods (e.g. natural policy 
gradient) but makes several changes, most notably by 
using a fixed KL divergence rather than a fixed penalty 
coefficient. How does this affect the performance of 
the algorithm? 

3. Can TRPO be used to solve challenging large-scale 
problems? How does TRPO compare with other 
methods when applied to large-scale problems, with 
regard to final performance, computation time, and 
sample complexity? 

To answer (1) and (2), we compare the performance of 
the single path and vine variants of TRPO, several ablated 
variants, and a number of prior policy optimization algo¬ 
rithms. With regard to (3), we show that both the single 
path and vine algorithm can obtain high-quality locomo¬ 
tion controllers from scratch, which is considered to be a 
hard problem. We also show that these algorithms produce 
competitive results when learning policies for playing Atari 
games from images using convolutional neural networks 
with tens of thousands of parameters. 


The policy iteration update can also be obtained by solving 
the unconstrained problem maximize^ L^^jj(7r), using L 
as defined in Equation ([^. 


Several other methods employ an update similar to Equa¬ 
tion (T^. Relative entropy policy search (REPS) (|Peters| 


et al. 


2010 1 constrains the state-action marginals p{s,a). 


while TRPO constrains the conditionals p{a\s). Unlike 
REPS, our approach does not require a costly nonlinear op¬ 
timization in the inner loop. Levine and Abbeel ( 2014| l also 
use a KL divergence constraint, but its purpose is to encour¬ 
age the policy not to stray from regions where the estimated 
dynamics model is valid, while we do not attempt to esti¬ 
mate the system dynamics explicitly. Pirotta et al. (20131 
also build on and generalize Kakade and Langford’s results, 
and they derive different algorithms from the ones here. 


8.1 Simulated Robotic Locomotion 


We conducted the robotic locomotion experiments using 
the MuJoCo simulator ( |Todorov et al. |2012| l. The three 
simulated robots are shown in Eigure The states of the 
robots are their generalized positions and velocities, and the 
controls are joint torques. Underactuation, high dimension¬ 
ality, and non-smooth dynamics due to contacts make these 
tasks very challenging. The following models are included 
in our evaluation: 


1. Swimmer. 10-dimensional state space, linear reward 
for forward progress and a quadratic penalty on joint 
effort to produce the reward r{a;,M) = —10“®||up. 

The swimmer can propel itself forward by making an 
undulating motion. 
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Input Conv. Conv. Hidden Action 

layer layer layer layer probabilities 



16 filters 16 filters 20 units 


Figure 3. Neural networks used for the locomotion task (top) and 
for playing Atari games (bottom). 


2. Hopper. 12-dimensional state space, same reward as 
the swimmer, with a bonus of -|-1 for being in a non¬ 
terminal state. We ended the episodes when the hop¬ 
per fell over, which was defined by thresholds on the 
torso height and angle. 


3. Walker. 18-dimensional state space. For the walker, 
we added a penalty for strong impacts of the feet 
against the ground to encourage a smooth walk rather 
than a hopping gait. 


We used 5 = 0.01 for all experiments. See Table in the 
Appendix for more details on the experimental setup and 
parameters used. We used neural networks to represent the 
policy, with the architecture shown in Figure]^ and further 
details provided in Appendix To establish a standard 
baseline, we also included the classic cart-pole balancing 
problem, based on the formulation from Barto et al. ( 1983| l, 
using a linear policy with six parameters that is easy to opti¬ 
mize with derivative-free black-box optimization methods. 


The following algorithms were considered in the compari¬ 
son: single path TRPO; vine TRPO; cross-entropy method 
(CEM), a gradient-free method ( Szita & L6rincz[ |2006| l; 
covariance matrix adaption (CMA), another gradient-free 
method ([Hansen & Ostermeierj |1996|l; natural gradi¬ 
ent, the classic natural policy gradient algorithm (jKakadej 


20021, which differs from single path by the use of a fixed 


penalty coefficient (Lagrange multiplier) instead of the KL 
divergence constraint; empirical FIM, identical to single 
path, except that the FIM is estimated using the covariance 
matrix of the gradients rather than the analytic estimate; 
max KL, which was only tractable on the cart-pole problem, 
and uses the maximum KL divergence in Equation 
rather than the average divergence, allowing us to evaluate 
the quality of this approximation. The parameters used in 
the experiments are provided in Appendix]^ Eor the natu¬ 



Figure 4. Learning curves for locomotion tasks, averaged across 
five runs of each algorithm with random initializations. Note that 
for the hopper and walker, a score of — 1 is achievable without any 
forward velocity, indicating a policy that simply learned balanced 
standing, but not walking. 


ral gradient method, we swept through the possible values 
of the stepsize in factors of three, and took the best value 
according to the final performance. 


Learning curves showing the total reward averaged across 
five runs of each algorithm are shown in Eigure Single 
path and vine TRPO solved all of the problems, yielding 
the best solutions. Natural gradient performed well on the 
two easier problems, but was unable to generate hopping 
and walking gaits that made forward progress. These re¬ 
sults provide empirical evidence that constraining the KL 
divergence is a more robust way to choose step sizes and 
make fast, consistent progress, compared to using a fixed 
penalty. CEM and CMA are derivative-free algorithms, 
hence their sample complexity scales unfavorably with the 
number of parameters, and they performed poorly on the 
larger problems. The max KL method learned somewhat 
more slowly than our final method, due to the more restric¬ 
tive form of the constraint, but overall the result suggests 
that the average KL divergence constraint has a similar ef¬ 
fect as the theorecally justified maximum KL divergence. 
Videos of the policies learned by TRPO may be viewed on 
the project website: http://sites.google.com/ 
site/trpopaper/ 


Note that TRPO learned all of the gaits with general- 
purpose policies and simple reward functions, using min¬ 
imal prior knowledge. This is in contrast with most prior 
methods for learning locomotion, which typically rely on 
hand-architected policy classes that explicitly encode no¬ 


tions of balance and stepping (Tedrake et al.j 2004 Geng 


et al. 2006| Wampler & Popovic 2009| l. 
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B. Rider 

Breakout 

Enduro 

Pong 

Q*bert 

Seaquest 

S. Invaders 

Random 

354 

1.2 

0 

-20.4 

157 

110 

179 

Human jMnih et al.|2013| 

7456 

31.0 

368 

-3.0 

18900 

28010 

3690 

Deep Q Learning jM^^t al.|2013j 

4092 

168.0 

470 

20.0 

1952 

1705 

581 

UCC-I |Guo et aL|2014j 

5702 

380 

741 

21 

20025 

2995 

692 

TRPO - single path 

1425.2 

10.8 

534.6 

20.9 

1973.5 

1908.6 

568.4 

TRPO - vine 

859.5 

34.2 

430.8 

20.9 

7732.5 

788.4 

450.2 


Table 1. Performance comparison for vision-based RL algorithms on the Atari domain. Our algorithms (bottom rows) were run once 
on each task, with the same architecture and parameters. Performance varies substantially from run to run (with different random 
initializations of the policy), hut we could not obtain error statistics due to time constraints. 


8.2 Playing Games from Images 


To evaluate TRPO on a partially observed task with com¬ 
plex observations, we trained policies for playing Atari 
games, using raw images as input. The games require 
learning a variety of behaviors, such as dodging bullets and 
hitting balls with paddles. Aside from the high dimension¬ 
ality, challenging elements of these games include delayed 
rewards (no immediate penalty is incurred when a life is 
lost in Breakout or Space Invaders); complex sequences of 
behavior (Q*bert requires a character to hop on 21 differ¬ 
ent platforms); and non-stationary image statistics (Enduro 
involves a changing and flickering background). 


We tested our algorithms on the same seven games reported 
on in ( |Mnih et ak] |2013| l and ( |Guo et ah] |2014[ ), which are 
made available through the Arcade Learning Environment 
( |Bellemare et al.[[2013| l The images were preprocessed fol¬ 
lowing the protocol in Mnih et al ( 2013| l, and the policy was 
represented by the convolutional neural network shown in 
Eigure with two convolutional layers with 16 channels 
and stride 2, followed by one fully-connected layer with 20 
units, yielding 33,500 parameters. 


The results of the vine and single path algorithms are sum¬ 
marized in Table [T] which also includes an expert human 
performance and two recent methods; deep (J-learning 
( |Mnih et ar]|2013 |l, and a combination of Monte-Carlo Tree 
Search with supervised training ( |Guo et ^ 2014| l, called 
UCC-I. The 500 iterations of our algorithm took about 30 
hours (with slight variation between games) on a 16-core 
computer. While our method only outperformed the prior 
methods on some of the games, it consistently achieved rea¬ 
sonable scores. Unlike the prior methods, our approach 
was not designed specifically for this task. The ability to 
apply the same policy search method to methods as di¬ 
verse as robotic locomotion and image-based game playing 
demonstrates the generality of TRPO. 


9 Discussion 


We proposed and analyzed trust region methods for opti¬ 
mizing stochastic control policies. We proved monotonic 
improvement for an algorithm that repeatedly optimizes 
a local approximation to the expected return of the pol¬ 
icy with a KL divergence penalty, and we showed that an 


approximation to this method that incorporates a KL di¬ 
vergence constraint achieves good empirical results on a 
range of challenging policy learning tasks, outperforming 
prior methods. Our analysis also provides a perspective 
that unifies policy gradient and policy iteration methods, 
and shows them to be special limiting cases of an algo¬ 
rithm that optimizes a certain objective subject to a trust 
region constraint. 

In the domain of robotic locomotion, we successfully 
learned controllers for swimming, walking and hopping in 
a physics simulator, using general purpose neural networks 
and minimally informative rewards. To our knowledge, 
no prior work has learned controllers from scratch for all 
of these tasks, using a generic policy search method and 
non-engineered, general-purpose policy representations. In 
the game-playing domain, we learned convolutional neu¬ 
ral network policies that used raw images as inputs. This 
requires optimizing extremely high-dimensional policies, 
and only two prior methods report successful results on this 
task. 

Since the method we proposed is scalable and has strong 
theoretical foundations, we hope that it will serve as a 
jumping-off point for future work on training large, rich 
function approximators for a range of challenging prob¬ 
lems. At the intersection of the two experimental domains 
we explored, there is the possibility of learning robotic con¬ 
trol policies that use vision and raw sensory data as in¬ 
put, providing a unified scheme for training robotic con¬ 
trollers that perform both perception and control. The use 
of more sophisticated policies, including recurrent policies 
with hidden state, could further make it possible to roll state 
estimation and control into the same policy in the partially- 
observed setting. By combining our method with model 
learning, it would also be possible to substantially reduce 
its sample complexity, making it applicable to real-world 
settings where samples are expensive. 
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A Proof of Policy Improvement Bound 


This proof uses techniques from the proof of Theorem 4.1 in (Kakade & Langford 2002 1 , adapting them to the more 
general setting considered in this paper. 

Lemma 1. Given two policies tt, tt. 


T]{n) = ?7(7r)+ET 


'y ■'^-rriStT CLt) 


,t=0 


(19) 


This expectation is taken over trajectories t := (sq, ag, Si, Oq, ■ ■ ■), and the notation [... ] indicates that actions are 

sampled from ft to generate r. 


Proof. First note that a) = Es/^p(s/|s^o) [r(s) + 7 X 4 ( 5 ') — X4(s)]. Therefore, 


E, 


t\tt 

= Er| 
= Er| 




I 

00 

^7‘(?'(st) +7l4(st+i) - T4(st)) 

00 

-14(so) + ^7‘?'(st) 


t =0 


— ~Eso [l^C^o)] + ET|^ 
= TjyW 




t=o 


Rearranging, the result follows. 

Define A'^’^{s) to be the expected advantage of if over tt at state s: 

A'^’^{s) = Ea,..^.(.|s) [A(s,a)]. 


( 20 ) 

( 21 ) 

( 22 ) 

(23) 

(24) 

□ 

(25) 


Now Lemma[T]can be written as follows: 


77(7!) = 77(77) + Et 




.i =0 


Note that can be written as 


L^{ft) = 77 ( 77 ) +Et 




,t=o 


(26) 


(27) 


The difference in these equations is whether the states are sampled using tt or if. To bound the difference between r]{Tt) and 
L^(if), we will bound the difference arising from each timestep. To do this, we first need to introduce a measure of how 
much 77 and if agree. Specifically, we’ll couple the policies, so that they define a joint distribution over pairs of actions. 

Definition 1. (tt, if) is an a-coupled policy pair if it defines a joint distribution (a, d)|s, such that P{a f d|s) < a for all 
s. TT and if will denote the marginal distributions of a and d, respectively. 


In words, this means that at each state, (tt, if) gives us a pair of actions, and these actions differ with probability < a. 
Lemma 2. Let (tt, ft) be an a-coupled policy pair. Then 

[A-’*(st)] - E,,^^ [A-’*(st)] I < 2e(l - (1 - af), 
where e = max|A’^’''(s)| 

S 


( 28 ) 
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Proof. Consider generating a trajectory using tt, i.e., at each timestep i we sample (a^, ai)|si, and we choose the action 
and ignore a^. Let n* denote the number of times that 7^ di for i < t, i.e., the number of times that tt and tt disagree 
before arriving at state st- 

[A^’Hst)] = Pint = [A^’Hst)] + Pint > (29) 

Pint = 0) = (1 — af, and ^ because rit = 0 indicates that tt and tt 

agreed on all timesteps less than t. Therefore, we have 

[A^’^ist)] = (1 - a‘)E,,^,|„,=o + (1 - (1 - [A^’^ist)] (30) 

Subtracting Esj„,.,r|nt=o from both sides, 

[A^’Hst)] = (1 - (1 - a*)) (-E«^^,|„,=o [A^’H^t)] [A"’"(st)]) (31) 

[A^’Hst)] I < (1 - (1 - a*))ie + e) (32) 

□ 


Now we can sum over time to bound the error of L^r- 
Lemma 3. Suppose (tt, tt) is an a-coupledpolicy pair. Then 


\pin) - L^(fi)| < 


2e7a 


(1 - 7)(1 - 7(1 - a)) 


(33) 


Proof. 


p(7f) - L^(7f) = Et- 


- E, 


^i^A-^^ist) 


,i=0 


_t^o 
00 

t^Q 

00 

- L^i^)\ < [A^'Hst)] - E.,,.. [A^'Hst)] I 


t^o 

00 


<^y.2e.(l-(l-a*)) 




2e^a 


(1 - 7)(1 - 7(1 - a)) 


(34) 

(35) 

(36) 

(37) 

(38) 

□ 


Last, we need to use the correspondence between total variation divergence and coupled random variables: 


Suppose px and py are distributions with DyviPx \\ Py) = ol. Then there exists a joint distribution (2f, F) 
whose marginals are px,PY, for which X = Y with probability 1 — a. 


See (Levin et al. 2009) 1, Proposition 4.7. 

It follows that if we have two policies tt and tt such that maxs DTviT^i'\s) || 7r(-|s))a, then we can define an a-coupled 
policy pair (tt, tt) with appropriate marginals, [^follows. 
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B Perturbation Theory Proof of Policy Improvement Bound 

We also provide a different proof of [T]using perturbation theory. This method makes it possible to provide slightly stronger 
bounds. 

Theorem la. Let a denote the maximum total variation divergence between stochastic policies n and tt, as defined in 
Equation •S', and let L be defined as in Equation 0. Then 

r]{n) > L{n) - (39) 


where 


e 


min 

s 


Eg (^(Q|s)Q7r(s,a) 
Ea\Ha\s) 


Tr{a\s)Q^{s,a)) 

7r(a|s)| 


(40) 


Note that the e defined in Equation (40 1 is less than or equal to the e defined in So [Ta]is slightly stronger. 


Proof. Let 6*= (1+7P^ + ( 7 P^)^ + ...) = (1-7P^)“^, and similarly Let G = + + ...) = (l- 7 Pjj.)“^. 

We will use the convention that p (a density on state space) is a vector and r (a reward function on state space) is a dual 
vector (i.e., linear functional on vectors), thus rp is a scalar meaning the expected reward under density p. Note that 
r]{TT) = rGpo, and 77 ( 77 -) = cGpo. Let A = Ps- — P^-. We want to bound 77 ( 77 -) — 77 ( 77 ) = r(G — G)po. We start with some 
standard perturbation theory manipulations. 

G-i - G-i = (1 - 7P.) - (1 - 7 -P^) 

= 7A. ( 41 ) 

Left multiply by G and right multiply by G. 

G-G = jGAG 

G = G + 7GAG ( 42 ) 

Substituting the right-hand side into G gives 

G = G + 7GAG + 7^GAGAG ( 43 ) 

So we have 

77(77-) — 77(77) = r(G — G)p = jrGAGpo + 7^rGAGAGpo ( 44 ) 


Let us hrst consider the leading term jrGAGpo- Note that rG = v, i.e., the inhnite-horizon state-value function. Also 
note that Gpo = Thus we can write 'ycGAGpo = 77 ;Ap^. We will show that this expression equals the expected 
advantage p 7 r( 7 f) — LT^ipK). 


P^(77-) - P^(7r) = ^ p^(s) ^(7r(a|s) - 7r(a|s))A^(s, a) 

s a 

= {'^e{a\s) - 7 rg(a|s)) r(s) + ^p(s'|s, 0 ) 777 ( 5 ') - 7 )(s) 

s a L s' 

= “ ^(«|s))p(s'|s, a)7^'(s') 

s s' a 

= “ Ptris'\s))jv{s') 

s s' 

= 77 ; Ap^ 


(45) 


Next let us bound the O(A^) term 7 ^rGAGAGp. Lirst we consider the product ^rGA = jvA. Consider the component 
s of this dual vector. 
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{'yvA)s = a) - 7r(s, a))Q^{s, a) 


= ^|t(a|s) - 7r(a|s)| 


Ea(^('S> a) - 7r(s, a))Q^{s, a) 


< ae 


We bound the other portion GAGp using the £i operator norm 

ll^lli = sup 


Eal^(«l'S) -7r(a|s)| 


II^pII 


\p\h 

where we have that HGHi = HGHi = 1/(1 — 7 ) and ||A||i = 2a. That gives 

||GAGp||i<||G||i||A||i||G||i||p||i 
' a.^.l 


So we have that 


1 — 7 1 — 7 

72|rGAGAGp| < 7||7rGA||oo||GAGp||i 

2a 

< 7 • ae 


= a 


(1-7)2 

2 27£ 

(1-7)2 


(46) 

(47) 


(48) 


(49) 

□ 


C Efficiently Solving the Trust-Region Constrained Optimization Problem 

This section describes how to efficiently approximately solve the following constrained optimization problem, which we 
must solve at each iteration of TRPO: 


maximizeL(0) subject to Ilp;L( 0 oid 5 0) < (50) 

The method we will describe involves two steps: (1) compute a search direction, using a linear approximation to objective 
and quadratic approximation to the constraint; and ( 2 ) perform a line search in that direction, ensuring that we improve the 
nonlinear objective while satisfying the nonlinear constraint. 


The search direction is computed by approximately solving the equation Ax = g, where A is the Fisher information 
matrix, i.e., the quadratic approximation to the KL divergence constraint: D^i^{6o\d, 0) « |(0 — 0oid)^4l(0 — 0oid), where 
~ ^ ^^KL((*oid) S). In large-scale problems, it is prohibitively costly (with respect to computation and memory) to 
form the full matrix A (or A~^). However, the conjugate gradient algorithm allows us to approximately solve the equation 
Ax = b without forming this full matrix, when we merely have access to a function that computes matrix-vector products 
y Ay. Appendix |C . 1 1 describes the most efficient way to compute matrix-vector products with the Fisher information 
matrix. For additional exposition on the use of Hessian-vector products for optimizing neural network objectives, see 
( Martens & Sutskever||2012] l and ( |Pascanu & Bengio} 2013| l. 


Having computed the search direction s « A we next need to compute the maximal step length (3 such that 0 + jSs 
will satisfy the KL divergence constraint. To do this, let (5 = ~ A{(3s) = As. From this, we obtain 

/3 = yj25/s^ As, where 5 is the desired KL divergence. The term s^As can be computed through a single Hessian vector 
product, and it is also an intermediate result produced by the conjugate gradient algorithm. 


Last, we use a line search to ensure improvement of the surrogate objective and satisfaction of the KL divergence constraint, 
both of which are nonlinear in the parameter vector 0 (and thus depart from the linear and quadratic approximations used 
to compute the step). We perform the line search on the objective 6) < <5], where X \...] equals 

zero when its argument is true and -boo when it is false. Starting with the maximal value of the step length /3 computed 
in the previous paragraph, we shrink /3 exponentially until the objective improves. Without this line search, the algorithm 
occasionally computes large steps that cause a catastrophic degradation of performance. 
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C.l Computing the Fisher-Vector Product 


Here we will describe how to compute the matrix-vector product between the averaged Fisher information matrix and 
arbitrary vectors. This matrix-vector product enables us to perform the conjugate gradient algorithm. Suppose that the 
parameterized policy maps from the input x to “distribution parameter” vector ^e{x), which parameterizes the distribution 
7r(u|a;). Now the KL divergence for a given input x can be written as follows: 

^KL(7re„i^(-|x) II Tre{-\x)) = kl(^e(a:),poid) (51) 


where kl is the KL divergence between the distributions corresponding to the two mean parameter vectors. Differentiating 
kl twice with respect to 9, we obtain 


d^a{x) dfib{x) 

dOi dOj 


klah(Me(a::),Moid) + 


d'^lJLaix) 

de.dOj 


^'^'aif^eix), Hold) 


(52) 


where the primes (') indicate differentiation with respect to the first argument, and there is an implied summation over 
indices a, b. The second term vanishes, leaving just the first term. Let J := (the Jacobian), then the Fisher 

information matrix can be written in matrix form as where M = kl”^{Heix), Hold) is the Fisher information 

matrix of the distribution in terms of the mean parameter p, (as opposed to the parameter 9). This has a simple form for 
most parameterized distributions of interest. 

The Fisher-vector product can now be written as a function y ^ J^MJy. Multiplication by and J can be performed by 
most automatic differentiation and neural network packages (multiplication by is the well-known backprop operation), 
and the operation for multiplication by M can be derived for the distribution of interest. Note that this Fisher-vector product 
is straightforward to average over a set of datapoints, i.e., inputs x to p. 


One could alternatively use a generic method for calculating Hessian-vector products using reverse mode automatic differ¬ 
entiation ((Wright & Nocedal 19991, chapter 8), computing the Hessian of Ukl with respect to 9. This method would be 


slightly less efficient as it does not exploit the fact that the second derivatives of h{x) (i.e., the second term in Equation (52 1 ) 
can be ignored, but may be substantially easier to implement. 

We have described a procedure for computing the Fisher-vector product y Ay, where the Fisher information matrix is 
averaged over a set of inputs to the function p. Computing the Fisher-vector product is typically about as expensive as 
computing the gradient of an objective that depends on h{x) (Wright & Nocedal 1999 i. Furthermore, we need to compute 
k of these Fisher-vector products per gradient, where k is the number of iterations of the conjugate gradient algorithm we 
perform. We found fc = 10 to be quite effective, and using higher k did not result in faster policy improvement. Hence, a 
naive implementation would spend more than 90% of the computational effort on these Fisher-vector products. However, 
we can greatly reduce this burden by subsampling the data for the computation of Fisher-vector product. Since the Fisher 
information matrix merely acts as a metric, it can be computed on a subset of the data without severely degrading the 
quality of the final step. Hence, we can compute it on 10% of the data, and the total cost of Hessian-vector products will 
be about the same as computing the gradient. With this optimization, the computation of a natural gradient step A~^g does 
not incur a significant extra computational cost beyond computing the gradient g. 


D Approximating Factored Policies with Neural Networks 

The policy, which is a conditional probability distribution 7re(a|s), can be parameterized with a neural network. The most 
straightforward way to do so is to have the neural network map (deterministically) from the state vector s to a vector h that 
specifies a distribution over action space. Then we can compute the likelihood p(a|p,) and sample a ^ p(a|p). 

For our experiments with continuous state and action spaces, we used a Gaussian distribution, where the covariance matrix 
was diagonal and independent of the state. A neural network with several fully-connected (dense) layers maps from the 
input features to the mean of a Gaussian distribution. A separate set of parameters specifies the log standard deviation of 
each element. More concretely, the parameters include a set of weights and biases for the neural network computing the 
mean, {Wi, and a vector r (log standard deviation) with the same dimension as a. Then, the policy is defined by 

the normal distribution Af ^mean = NeuralNet {Wt, , stdev = exp(r)^. Here, h = [mean, stdev]. 

For the experiments with discrete actions (Atari), we use a factored discrete action space, where each factor is parameter¬ 
ized as a categorical distribution. That is, the action consists of a tuple (oi, 02 ,..., ax) of integers ak S {1,2,..., Nk}, 
and each of these components is assumed to have a categorical distribution, which is specified by a vector Hk = 
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[PitP2t ■ ■ iPNk]- Hence, /i is defined to be the concatenation of the factors’ parameters: /r = . ,Pk\ and 

has dimension dim p = Y^^=i ^k- The components of p are computed by taking applying a neural network to the input s 
and then applying the softmax operator to each slice, yielding normalized probabilities for each factor. 

E Experiment Parameters 



Swimmer 

Hopper 

Walker 

State space dim. 

10 

12 

20 

Control space dim. 

2 

3 

6 

Total num. policy params 

364 

4806 

8206 

Sim. steps per iter. 

50K 

IM 

IM 

Policy iter. 

200 

200 

200 

Stepsize (T^kl) 

0.01 

0.01 

0.01 

Hidden layer size 

30 

50 

50 

Discount ( 7 ) 

0.99 

0.99 

0.99 

Vine: rollout length 

50 

100 

100 

Vine: rollouts per state 

4 

4 

4 

Vine: Q-values per batch 

500 

2500 

2500 

Vine: num. rollouts for sampling 

16 

16 

16 

Vine: len. rollouts for sampling 

1000 

1000 

1000 

Vine: computation time (minutes) 

2 

14 

40 

SP: num. path 

50 

1000 

10000 

SP: path len. 

1000 

1000 

1000 

SP: computation time 

5 

35 

100 


Table 2. Parameters for continuous control tasks, vine and single path (SP) algorithms. 



All games 

Total num. policy params 

33500 

Vine: Sim. steps per iter. 

400K 

SP: Sim. steps per iter. 

lOOK 

Policy iter. 

500 

Stepsize 

0.01 

Discount ( 7 ) 

0.99 

Vine: rollouts per state 

« 4 

Vine: computation time 

« 30 hrs 

SP: computation time 

« 30 hrs 


Table 3. Parameters used for Atari domain. 
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F Learning Curves for the Atari Domain 
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Figure 5. Learning curves for the Atari domain. For historical reasons, the plots show cost = negative reward. 


















































































